Process and method for data assurance management by applying data assurance metrics

ABSTRACT

The present invention relates generally to methods, software and systems for measuring and valuing the quality of information and data, where such measurements and values are made and processed by implementing objectively defined, measurable, comparable and repeatable dimensions using software and complex computers. The embodiments include processes, systems and method for identifying optimal scores of the data dimension. The invention further includes processes, systems and method for data filtering to improve the overall data quality of a data source. Finally, the invention further includes processes, systems and method for data quality assurance of groups of rows of a database.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. patent application Ser. No.13/391,457 filed Feb. 21, 2012, which claims the benefit of U.S.Provisional Application No. 61/234,825, filed Aug. 18, 2009. Priority isclaimed to all of the above-mentioned applications, and each applicationis hereby incorporated by reference.

BACKGROUND

1. Field of the Invention

The present invention relates generally to methods, software and systemsfor measuring and valuing the quality of information and data, wheresuch measurements and values are made and processed by implementingobjectively defined, measurable, comparable and repeatable dimensionsusing software and complex computers.

2. Description of Related Art

Commercial companies and government agencies purchase, generate, collectand acquire large amounts of data and information, both externally andinternally, to run their businesses or governmental functions. This datais used to manufacture an information product, designed to improve theoutcome of a process such as the decision to grant credit, issue aninsurance policy or the course of treatment for a patient. The data andinformation relating to individuals upon which multi-million dollarbusiness decisions rely, have no data quality dimensions or metrics.

Today companies and government agencies invest significant amounts indata and information acquisition, generation, collection, aggregation,storage and use. The companies and government agencies make decisions,incur expenses, generate revenue, make policy, engage in activities andregulation all based on their data and information.

Business managers, decision makers and business intelligence modelersrely on automated systems to strain and sieve through oceans of data tofind just the right combination of data elements from multiple sourcesto make decisions. The data elements they extract and use may be wrongor incomplete, or worse yet, the information may be correct but nottimely or not have enough coverage from which to glean valuabledecisions. Companies which use large amounts of data in their businessprocesses do not presently know the absolute and relative value of theirdata assets or the economic life of such assets, as measured and scoredby the implementation of data metrics. These same companies do notpresently know how to best use their data assets.

Further, the present state of industry for information and data qualityis processing the information for entity resolution, i.e. identifyingone individual by various names, cleansing, deduplication, integrationand standardization and information elements. While these functions areappropriate, they do not include any form of data assurance management.

Surveys have revealed that data quality is considered important tobusiness, and data should be treated as a strategic asset. These samecompanies, however, rarely have a data optimizing or data governancestrategy. Further, there are no systems for predicting andsystematically verifying data assurance in the marketplace. Instead,data quality software tool vendors typically focus on namede-duplication and standardization of addresses with USPS standards.Thus, currently every piece of this data or information is treated withequal weight and value, with no distinction among the data and itsquality for such metrics as to its accuracy, relevance, timeliness,completeness, coverage or provenance.

There exists a need for automated systems and methods for measuring andscoring dimension of data with a result of metrics. There also exists aneed to understand and evaluate the true value of data in relation tobusiness applications, to maximize potential and create and measure datavalue. These data assurance needs include: (i) the relative, comparedcontribution of data sources; (2) the absolute contribution of datasources to the data product being created; (3) the score or standardizedmeasure of value of a data source in its application, data class or datause; (4) the optimization of data sources in the optimal order offunctional use such as the cascading of data sources in a priority ofuse order to obtain the best or optimal sequential use of the group ofdata sources; and (5) the determination of the intangible asset value ofthe data investment of a company.

SUMMARY

The problems presented are solved by the systems and methods of theillustrative embodiments described herein.

One purpose of this invention is to invoke data assurance management andmetrics on data to provide better data or mixture of data for aparticular business or governmental application such as incomevalidation, asset validation, wealth indicators, model validation,refinance, credit line increase or decrease, managing risk, mortgageportfolio runoff, debtor early warning systems, more timely view ofdebtors than provided by credit bureaus, government assistedtransactions, intelligence community decisioning, and anti-moneylaundering.

The invention is a process and method for management of data assets bycomputing data quality metrics on data sources to identify qualityissues and measure the overall health of data based assets. Based on thevalues of these metrics, data may be transformed to improve the value ofone or more metrics. One embodiment of the invention is for methods,systems and processes of data assurance management. Such methods,systems and processes comprises the steps of selecting a plurality ofdata elements as source inputs, conducting a statistical random samplingof the plurality of data elements, scoring the statistical randomsampling, wherein said scoring yields data metrics, determining thefrontier data points, utilizing the frontier data points to select anoptimal data aggregation and integrating the optimal data aggregationinto an output database. The data assurance management method canfurther include data metrics selected from the group consisting of:Accuracy, Redundancy/Uniqueness, Velocity, Acceleration, Completeness,Measure, Timeliness, Coverage, Consistency, Availability, Read Time,Write Time, Propagation Time and Precision. The data assurancemanagement can further include multivariate optimization. Further, thedata assurance management can include integrating the optimal dataaggregation into an output database further comprises determining if thedata is entered into an integrated database, wherein the data is onlyentered if the data is unique or a rules engine selects the data as anoptimal data set.

Other embodiments include data assurance management methods, systems andprocesses for data filtering, which includes selecting a plurality ofdata elements as source inputs a data base, retrieving the dataelements, utilizing rules in a data hygiene engine to determine if thedata is valid, wherein the data hygiene engine contains one or more ofthe following types of rules: field rules, row rules and groups of rowsrules; wherein if the data is valid, the data is saved in a productiondatabase.

This application further includes numerous embodiments and variants allapplicable to data assurance management methods, systems and processes.

Other objects, features, and advantages of the illustrative embodimentswill become apparent with reference to the drawings and detaileddescription that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1. is a flowchart describing a process or method for identifyingoptimal scores of the data dimension.

FIG. 2. is an example of a two dimensional plot in accordance with theinvention.

FIG. 3 is an example of a two dimensional plot utilized to determine thefrontier data points.

FIG. 4 is a flowchart describing the application of a rules engine tocreate integrated data.

FIG. 5 is a flowchart describing a process or method for data filteringto improve the overall data quality of a data source.

FIG. 6 is a spreadsheet illustrating the difference between a field,group of fields within a row and a group of rows.

FIG. 7 is a spreadsheet illustrating the difference between a row and agroup and showing a group is more than one row.

FIG. 8 is an example of a two dimensional plot utilized to comparefrontier data points.

FIG. 9 is an example of a Haase plot utilized to determine the frontierdata points.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

In the following detailed description of the illustrative embodiments,reference is made to the accompanying drawings that form a part hereof.These embodiments are described in sufficient detail to enable thoseskilled in the art to practice the invention, and it is understood thatother embodiments may be utilized and that logical structural,mechanical, electrical, and chemical changes may be made withoutdeparting from the spirit or scope of the invention. To avoid detail notnecessary to enable those skilled in the art to practice the embodimentsdescribed herein, the description may omit certain information known tothose skilled in the art. The following detailed description is,therefore, not to be taken in a limiting sense, and the scope of theillustrative embodiments are defined only by the appended claims.

The invention generally is to methods and systems for data assurancemanagement. These methods and systems may be implemented in whole or inpart by a computer while executing instructions from a data accessmodule. A data access module includes executable instructions to modifya data source query (e.g., a Structured Query Language (SQL) query, aMultiDimensional eXpressions (MDX) query, a Data Mining Extensions (DMX)query and the like) to include specified filters. The data access modulealso includes executable instructions to apply the generated data sourcequery to an underlying data source, which may form a portion of acomputer or may be accessed as a separate networked machine through thenetwork interface circuit.

As will be appreciated by one of skill in the art, aspects of thepresent invention may be embodied as a method, data processing system,computer program product, or embedded system. Accordingly, aspects ofthe present invention may take the form of an entirely hardwareembodiment or an embodiment combining software and hardware aspects.Furthermore, elements of the present invention may take the form of acomputer program product on a computer-usable storage medium havingcomputer-usable program code embodied in the medium. Any suitablecomputer readable medium may be utilized, including hard disks, CD-ROMs,optical storage devices, flash RAM, transmission media such as thosesupporting the Internet or an intranet, or magnetic storage devices.

Computer program code for carrying out operations of the presentinvention may be written in an object oriented programming language suchas Java, Smalltalk, C++, C#, Visual Basic, .NET enabled language, or inconventional procedural programming languages, such as the “C”programming language. The program code may execute entirely on theuser's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer, or entirely on the remote computer.

The computer may also include a dataset database, which is a source ofdata that one of skill in the art desires to analyze for data assurance.The dataset database may form a portion of a computer or may be accessedas a separate networked machine through a network interface circuit.

It should be appreciated that any network described herein may includeany system for exchanging data or performing steps of the invention,such as Internet, intranet, extranet, WAN, LAN, satellite communication,cellular phone communications, and the like. Further, the communicationsbetween entities concerning the transaction or access request can occurby any mechanism, including but not limited to, Internet, intranet,extranet, WAN, LAN, point of interaction device (point of sale device,personal digital assistant, cellular phone, kiosk, etc.), onlinecommunication, off line communication, and wireless connection. Thepresent invention might further employ any number of conventionaltechniques for data transmission, signaling, data processing, networkcontrol, and the like. For example, radio frequency and other wirelesstechniques can be used in place of any network technique describedherein.

The computer may have multiple input devices, output devices orcombinations thereof. Each input/output device may permit the user tooperate and interface with computer system and control access, operationand control of the data assurance management system. The computer mayalso include the various well known components, such as one or moretypes of memory, graphical interface devices such as a monitor, handheldpersonal device, and the like. The graphical interface devices may notonly present information from a computer physically connected therewith,but also web pages that may form part of the data assurance managementsystem.

The computer system may communicate with a web server, applicationserver, database server or the like, and may access other servers, suchas other computer systems similar to the computer via a network. Theservers may communicate with one another by any communication means toexchange data or perform steps of the invention.

Aspects of the present invention are described with reference toflowchart illustrations and/or block diagrams of methods, systems andcomputer program products according to embodiments of the invention. Itwill be understood that each block of the flowchart illustrations and/orblock diagrams, and combinations of blocks in the flowchartillustrations and/or block diagrams, can be implemented by computerprogram instructions. These computer program instructions may beprovided to a processor of a general purpose computer, special purposecomputer, server, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in acomputer-readable memory that can direct a computer or otherprogrammable data processing apparatus to function in a particularmanner, such that the instructions stored in the computer-readablememory produce an article of manufacture including instruction meanswhich implement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,server or other programmable data processing apparatus to cause a seriesof operational steps to be performed on the computer or otherprogrammable apparatus to produce a computer implemented process suchthat the instructions which execute on the computer or otherprogrammable apparatus provide steps for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks, and mayoperate alone or in conjunction with additional hardware apparatusdescribed herein.

Various embodiments of the present invention will now be described withreference to the figures in which like numbers correspond to likereferences throughout.

The invention has several applications. For instance, the invention canbe used to acquire customers, including processes and systems for use inmarketing, originations, mergers or acquisitions. The invention can alsobe used for portfolio management, for tasks such as customer management,collections and recovery, mortgage refinance, mortgage portfoliosegmentation, early warning systems and enterprise risk management.Further, the invention can be used for financial protection, includingthe areas of fraud, regulatory compliance, privacy compliance and antimoney laundering.

The results and benefits of employing the invention for determining dataquality are numerous. For instance, unneeded data sources can bedeleted, or if more data is needed new sources can acquire it. Further,it can lead to negotiating more favorable data contracts and improveseveral parameters of information quality. Parameters of information mayinclude one or more of the following: data accuracy, timeliness,completeness and relevancy. Another benefit to determining data qualityis the ability to create and implement business risk mitigationprograms. The invention can result in a report to help assure that theuse of the data is in compliance with the laws and regulations of theUnited States, various States and foreign countries. The invention canbe used to create an asset value for data and treat the data much likean investment portfolio with all types of dynamic principles or factorsapplied such as patterns, inertia, comparables, ratios, percentages,yields, risks, safety, tax considerations, liquidity, returns,derivatives, marketability, pricing, access and content. In sum,assuring and managing data quality by utilizing the invention can leadto an increase in return on investment (“ROI”).

Further, the data assurance management processes and methods disclosedherein are not limited to any industry or government function or agency,and may be used, for example, in the industries of Financial Services,Telecommunications, Publishing, Healthcare, and Retail and in theFederal Government functions and agencies for the intelligencecommunity, Department of Justice, US Army, US Navy, Department ofHomeland Security, Social Security, Medicare, and EnvironmentalProtection Agency.

The results of the data assurance method are numerous, and may includeone or more of the following:

1. Maximize the return on investment of each marketing campaign (rightperson, right message, timely, relevant, accurate, profitable)

2. Reduce bad debt

3. Improve collection of debt

4. Reduce the days outstanding (DSO)

5. Optimize payments terms & options

6. Improve effectiveness and efficiency of distribution channels

7. Improved customer profiles

8. Improved customer segmentation

9. Improved prospect profile

10. Improve prospect segmentation

11. Identify “best customer” attributes

12. Calculate the density of variables & elements

13. Perform needs and haves assessment

14. Consolidation of marketing databases and appending financial records

15. Assessment of duplicates in customer/internal databases (multiproduct purchasers, multi inquirers (Tire kickers), conversion rates)

16. Data classification system for privacy & compliance

17. Data compilation methods

18. Source updates

19. File updates

20. Calculate degradation of file

21. Appending/overwriting

22. Append internal financial records to marketing databases tounderstand where cash and profits are derived

23. Contact preference website to insure accurate data and improvecustomer care

24. Contact inactive persons or customers with communication preferencesand to understand their interests

25. Append new data to records that have a high propensity to respond

26. From volume based marketing to Value based direct marketing

27. Enriched content improves acquisition, performance, and costeffectiveness

28. Create standard ratings system

29. Present value calculations

30. Yield to maturity

31. Current yield

32. Data risk based on economic conditions

33. Calculation of speed

34. Price sensitivity (sensitivity increases at decreasing rate asmaturity approaches)

35. Effective maturity

36. Duration

37. Score the attributes and value of a data source compared to otherdata sources for a particular information quality dimension-metric

38. Score the attributes and value of a group of data sources comparedto other groups of data sources for a series of information qualitydimensions-metrics

39. Determination and calculation of value of data assets

40. Determination and calculation of amortization of value of dataassets

41. Optimize external data purchases

42. Optimize internal data accumulation and creation

43. Optimize data maintenance costs

44. Evaluate impact of loss or interruption of data sources

45. Determine alternative data sources in the event of loss orinterruption of data sources

46. Comparison of data sources, internal and external, upon acquisitionor merger of Companies

47. Reduction of data costs

48. Increase value of data assets

49. Create volatility metrics for data assets

50. Synchronize contract due dates (enterprise)

51. Perform Cost/Benefit analysis

52. Create & apply return on data assets calculation

53. Create unique customer insights

54. Enrich customer data with internal financial performance data

55. Suppress external data based on internal financial performance data.

56. Incubate efficient e-business

57. Deploy best-of-breed interfaces

58. Enhance efficient e-services

59. Transform integrated, confusing data sources to usefulness and

60. Iterate the synergies of multiple Information Quality Dimensions.

Information quality metrics are computable values applied to informationdata structures. Generally, the metrics represent objective, computable,and comparable values used to assess the quality of the underlyinginformation. Computable values are represented as mathematical formulaethat may be applied generally to information data. These formulae assumethat information is represented as a data sequence (i.e. relationaldatabase, sequence of objects, etc.).

Metrics envisioned by the inventors should be objective, measurable,comparable, and repeatable. Metrics having these qualities are goodcandidates for information quality metrics as these properties allow forvalues to be reliably compared at different times. The ability tocompare metric values over time provides the ability to measureimprovement and to correlate improvement to other business metrics. Thisinvention concerns metrics that have a general applicability. Generalmetrics may be computed for a large number of disparate systems. Thesemetrics may be employed to compare information systems with differentdata, different purposes, and with different architectures. The metricsmodule includes executable instructions for various processes, discussedbelow.

To assist in understanding the invention, the following definitions areprovided.

DEFINITIONS

Objective—Metrics should be objective not subjective. Subjective metricsare prone to inconsistent interpretation or manipulation by the measurermaking them less reliable. For example, a questionnaire asking for anopinion (i.e. “Do you like our Information Quality on a 1-5 scale with 1lowest and 5 highest?”) can yield different results depending on how thequestion is worded. The resulting metric is not objective and may leadto inconsistent results. Similar questions may lead to different resultsundermining the utility of the measurement. This should not beunderstood to mean that all questionnaires lead to subjective metrics. Aquestion is objective so long as it is reasonably expected that themajority of responses interpret the question in the same way.

Measurable—Metrics should employ definitive units of measure to providea quantifiable value that is available for analysis. Non-measurablemetrics may have a theoretical or conceptual value, but have littleutility. Immeasurable metrics may be converted to measurable metrics.For example, an immeasurable quantity may be transformed to a measurablequantity when an adequate definition is provided. Second, someimmeasurable quantities may provide business value, but they should notconsidered metrics. Metrics, by definition, must be measurable.

Comparable—The value of a metric should be comparable to other values.Comparable means the units of measure are the same or convertible to oneanother such that the sum of more than one metric provides a meaningfulresult. For example, the cost of two metrics is comparable because onecan compare the cost values and determine that one is higher than theother. Similarly, quantity is comparable because one can compare twodifferent quantities and determine one is greater than the other.However, a metric for cost and a metric for quantity may not becomparable. The metrics have different units of measure (cost indollars, quantity in bytes) and their sum does not provide a meaningfulresult. Two such measurements may not be comparable in a meaningful way.

Repeatable—Repeating the same measurement under similar conditions toprovide similar results. If a measurement is not repeatable, its valueis unreliable. A measurement may be repeatable, however, even if itsprecise original measurement cannot be duplicated. For example, adatabase may be under constant updates, with data added and deletedquickly. The count of the number of rows at one instant is a repeatablemeasurement even though its measurement cannot be duplicated. Instead,the measurement is repeatable because if we measured again under similarconditions we would get similar results.

Set—A data set is a collection M into a whole of definite, distinctobjects m. Any collection of distinct items may be a set; however, themembers of the collection must be distinct.

Sequence—A sequence is an ordered list of objects. The members of asequence are not necessarily distinct. Data is often formulated as asequence, with a first element, second, third, etc. Data that is notordered may be made into a sequence by assigning a unique integer toeach data element. In this way an unordered bag of data can be made intoan ordered data sequence.

Cardinality |␣|—The number of elements in a sequence. Let S be a datasequence. The cardinality (number of elements in the sequence) isrepresented as |S|. Generally, the cardinality of a sequence may befinite, infinite, or transfinite. However, data sequences encountered inactual applications are always finite (it would require an infiniteamount of storage space to hold an infinite data sequence).

Parallel Sequences |—Let S and T be data sequences. The sequences S andT are called parallel if and only if |S|32 |T| and this is representedas (S|T). Parallel sequences are common in relational databases. Forexample, a data table with two fields: First Name, Last Name would haveeach column of data as a valid data sequence. Each of these sequenceshas the same number of rows, making these parallel sequences. Moreimportantly, each row of data is often taken together as a unit.Parallel sequences aide in formalizing this concept and identifyingsequence elements that belong together in the same unit.

Unique Sequence—A unique sequence is a sequence where every term isdifferent from all other terms. The elements of a unique sequence form aset, and the number of elements in the set is the same as the number ofelements in the sequence.

Oracle—An all knowing machine that is able to determine the true valueof some quantity, and can be used to define metrics that would beotherwise immeasurable. For purposes here, an oracle is a highly trustedsource of information.

Data Elements—A basic unit of information having a unique meaning anddistinct values. The characteristics of a data element are: 1) a name ofthe element, 2) clear definition and 3) an enumerated value. Examples ofdata elements include, but is not limited to, a person or entity name,address, phone number; a person's income, sex, marital status;proprietary medicinal product name; active ingredient; pharmaceuticaldose form; strength of the active ingredient; route of administration;or units such as parts per million (PPM) or parts per billion (PPB).

Metrics module—The metrics module is a module of executable instructionsreceived from one of skill in the art to determine which metrics will beapplied to data, how the data will be organized, if at all, and thepresentation of the metrics results. The metrics module includesexecutable instructions for the calculating the metrics.

Metrics—A measure of a data quality dimension, where a data qualitydimension is a characteristic of data. Metrics envisioned by theinventors should be objective, measurable, comparable, and repeatable.Metrics having these qualities are good candidates for informationquality metrics as these properties allow for values to be reliablycompared at different times. Application of metrics yields a value(score) for each data analyzed. The score may be numerical, such as apositive or negative number, or a symbol. More importantly, the scoremust be such that the score of each data in a plurality of samples maybe compared to one another. The terms “metrics,” “data quality metrics”and “dimensions” may be used interchangeably.

Below are several examples of dimensions and how they are scored toyield metrics.

-   -   I) Accuracy. Accuracy measures how close the test data sequence        S is to the ‘truth’ set. The truth set must be obtained from        external means and cannot be derived from S. Accuracy is used to        quantify the correctness of the information at hand. In any        large information system there are bound to be incorrect data        elements. Incorrect data may arise from many sources including        imperfect data entry or collection, transmission errors in        electronic systems, data that has changed but not been updated,        or any number of other causes of inaccurate data values. Data        that is incorrect often has limited use. A list of customer        names and addresses has little value if the data is inaccurate.        Measuring the accuracy of the data is useful in evaluating how        valuable the data in the information system is. To determine        accuracy, let τS→[0,1] be an oracle such that τ maps the        elements of the sequence s_(t)εS to the value 1 if and only if        the value of s_(i) is correct and ⁰ otherwise. The set S is        often produced through some measurement or data entry process.        These processes are prone to errors. The truth function τ        indicates whether a given sequence element is correct.

The accuracy A metric is defined as

$\mspace{20mu} {A = {\frac{1}{\max ( {{S},{\tau (S)}} )}\text{?}}}$?indicates text missing or illegible when filed

-   -   II) Redundancy/Uniqueness. Redundancy and Uniqueness are related        metrics. Redundancy measures the amount of duplicate data in an        information system, while Uniqueness measures the amount of        distinct information. Redundant data is often encountered in        large information systems, as well as when combining disparate        information systems. Understanding the data Uniqueness within an        information system allows one to compute how much useful data is        present. For example, a list of 10,000 customer names and        addresses is not very useful if all of the records are        identical. In addition, Redundancy is very useful when combining        information systems. For example, when businesses are combined        during a merger or acquisition, there is often a need to combine        information systems. In many cases, the value of the merger is        in part based on the value of combining their information        systems. In these cases it is useful to know how much data is        redundant between these systems in order to quantify the value        of the merger of the systems. The redundancy and uniqueness        metric is calculated as follows: let S be a data sequence and        let S be a set whose elements are the elements of S.

  R = 1 − ? ?indicates text missing or illegible when filed

Redundancy and Uniqueness are

  U = ? ?indicates text missing or illegible when filed

where both Redundancy and Uniqueness are on the range [0,1].

-   -   III) Velocity—Velocity measures the rate of change of data over        time. This is important because data is often dynamic and        changes over time. Understanding the Velocity of the data        assists in estimating how accurate the data is at any given        time. For example, if a list of customer names and addresses was        developed two years ago, and 1% of people move every year, then        1-2% of the data is no longer accurate. Alternatively, once        velocity is measured, collection methods can be designed to        minimize the impact of inaccurate data. For example, if one        tolerates 5% inaccurate data and the data changes 1% per month,        then data needs to be recollected every five months in order to        satisfy the tolerance bounds. Collecting more frequently may        result in unnecessary costs. The velocity metric is calculated        as follows: let S(t) be a data sequence at time t. Let        v|S×T→[0,2] be a map such that v=1 if s_(t)≠t_(i) and ⁰        otherwise.

$ \mspace{20mu} {v - {\frac{1}{\Delta \; t}\text{?}{s_{i}( {t + {\Delta \; t}} )}}} ),{\text{?}\text{indicates text missing or illegible when filed}}$

where Velocity is measured on the range (−∞,∞) and counts the number offields changed per unit time.

-   -   IV) Acceleration—Acceleration is the rate of change of velocity        over time. If the information velocity is changing over time,        the data may become inaccurate more or less quickly than        indicated by the velocity. Measuring the acceleration helps        determine how rapidly the velocity is changing, and provides        confidence in the utilization of velocity. When acceleration is        near zero, the velocity is near constant. When the acceleration        is not near zero, the velocity is changing and measuring        velocity is insufficient. In these circumstances, accounting for        the acceleration will provide better estimated on the length of        time required between data collections. The acceleration metric        is calculated as follows: let v(t) be the velocity measured at        time t. The Acceleration is

$\mspace{20mu} {{\text{?} = \frac{{v( {t + {\Delta \; t}} )} - {v(t)}}{\Delta \; t}},{\text{?}\text{indicates text missing or illegible when filed}}}$

and Acceleration is measured on the range (−∞,∞).

-   -   V) Completeness—Completeness may be used to measure how many        rows of data in a database have all required fields present.        Theoretically, if a field is required it must be present in each        row. On the other hand, a database may have fields that are not        required in the data table and are thus not present in every        row. Once the completeness of a database is calculated, it can        be compared with the completeness of other databases. The        completeness metric is measured by asking how many of the        elements of the test data sequence S are present versus how many        are left null (blank/no entry). In other words, let ρ:S→(0,1) be        a map such that ρ takes the value 1 if and only if s_(t) εS is        not null and ⁰ otherwise. Completeness is defined on the range        [0,1]. The completeness CP for a set of parallel sequences S₁,        S₂, . . . , S_(n) is defined as

$\mspace{20mu} {C_{p} = {\frac{1}{n{S}}{\text{?}.\text{?}}\text{indicates text missing or illegible when filed}}}$

-   -   VI) Amount of Data—Measures the relative amount of data present.        This measurement allows one to evaluate how much data is present        relative to how much is needed. This aspect of information        quality helps determine if the information system is capable of        supporting the present or future business needs. Some situations        require a certain amount of data present. For example, in a        marketing campaign, if 5% of people respond to a campaign and        become customers and one desires to get 1,000 new customers,        then the campaign must target at least 20,000 people. The amount        metric is defined as: let P be the number of data units provided        and n be the number of data units needed. The amount of data D        metric is

${D = \frac{p}{d}},$

where the amount of Data is on the range [0,∞]. When D<1 there is alwaysless data than needed. However, when D>1 there are more data units thanneeded, but this does not necessarily mean all data needed is presentbecause the data may be redundant, or the uniqueness may be too low.

-   -   VII) Timeliness—Timeliness examines the utility of data based on        the age of the data. Data is often a measurement over some        period of time and is valid for some period after. Over time,        the utility of the data decreases as the true values will change        while the measured data does not. For example, if it takes two        day to receive data, and the data is only valid for one second        after it is received, the timeliness is low. However, if the        data were valid for 200 days after delivery, the timeliness        would be much higher. To determine the timeliness metric, let f        be the expectation of the amount of time required to fulfill a        data request v be the length of time the data is valid after        delivery. The timeliness T is given by

$T = {\frac{f}{v}.}$

Timeliness is measured on the range (−∞,∞)). Negative values indicatethat the data is invalid when received.

-   -   VII) Coverage—Coverage measures the amount of data present in        relation to all data. Data is often a measurement of some type.        For example, if given a list of names and addresses of people in        the United States, the coverage would be the ratio between all        of the accurate data present to the total number of U.S.        households. Often an oracle is used to provide this information        since it may not be known how much data is actually present,        i.e., it's unknown exactly how many U.S. households are in        existence at any given time. To find the coverage metric, let        π:S→N be an oracle that provides the length of the complete data        sequence. Let τ:S→[0,1] be an oracle such that τ maps the        elements of the sequence s₁εS to the value 1 if and only if the        value of τ is correct and otherwise. The coverage C_(v) is

$\mspace{20mu} {C_{V} = {\frac{1}{\pi {S}}{\text{?}.\text{?}}\text{indicates text missing or illegible when filed}}}$

The coverage measures the amount of correct data in S in relation to thetotal amount of data in the true data sequence. Coverage is on the range[0,1].

-   -   VIII) Consistency—Consistency measures the number rule failures        in a data sequence as a proportion of all rule evaluations.        Rules are often applied to data sequences. Some rules can be        applies strictly to individual sequence elements (        :s₁<4∀s_(i)εS) or may be defined across multiple sequences (        : s₁+t₁=1∀s₁εS₁t₁εT, S|T). Consistency allows one to measure the        application of business rules to an information system. Once the        business rule is defined, data in the information system is        examined to determine how often the rule is satisfied. For        example, if the business rule is that all customers must provide        a valid telephone number when placing an order, the phone        numbers for the existing orders can be examined to determine how        often an order is associated with a valid telephone number. The        consistency metric is determined as follows: given a rule R, all        applications of R is computed to determine whether the rule is        satisfied (consistent) or is violated (inconsistent). Let be a        sequence of applications of        . Let x|R→(0,1) be a map such that X takes the value 1 if the        application t_(i)εR is consistent and 0 otherwise. The        consistency C_(S) is given by

$\mspace{20mu} {C_{S} = {\frac{1}{R}\text{?}}}$?indicates text missing or illegible when filed

-   -   IX) Availability—Availability measures how often a data sequence        is available for use. Databases may be unavailable at times for        maintenance, failure, security breaches, etc. Availability        measures the proportion of time a data sequence is available.        Measuring the availability of the information source over a        period of time provides insight to the overall data quality that        can be expected. To calculate the availability metric, let S be        a data sequence. During some finite time t, let A be the amount        of time S was available and U be the amount of time S was not        available so that A+U=2. The availability is

${A_{V} = {\frac{A}{A + U} = \frac{A}{t}}},$

where the availability is measured on the range [0,1].

-   -   X) Read Time—The Read Time measures how quickly data may be        accessed available upon request. When a user requests to access        a data sequence, there is a finite time required to gather the        information and provide it to the user. For a typical database        system, this is simply the time required to access stored        information. Other data sources, such as questionnaires, may        require elaborate means to access current data. The Read Time is        a measure that can be used to estimate the expected time        required to get information from a source. The read time metric        is the expectation of the time required to fulfill a data        request from S, where the read time is measured on the range [0,        ∞). The read time does not include the time required to analyze        the data and gather it into a useful form.    -   XI) Write Time—The write time measures how quickly an update to        a data sequence is available for use, i.e., the time required to        store information in an information system. When a user requests        to update a data sequence, there is a finite time required to        change the data and make the change available to others. The        Write Time measures this delay. The write times includes any        time required to process or analyze the information to put it        into a useful form. Thus, the write time metric is the        expectation of the time required to update a data sequence,        where the write time is measured on the range [0,∞)    -   XII) Propagation Time—The propagation time measures how quickly        an update to a data sequence may be used by combining the read        time with the write time. The propagation time is useful in        understanding the overall processing time to gather, analyze,        and access new information. To find the propagation time metric,        let w be the write time for a data sequence S and let r be the        read time on S. The propagation time is T_(p)=w+r, where the        propagation time is measured on the range [0,∞).    -   XIII) Relevancy—Relevancy is a measure of the dispersion of data        about a central point in relation to the true value of the data.        The predicted value, as presented in the data, is compared to        the true value. A tolerance is set, measured either as a percent        range against the true value or predicted value, or as an        absolute range against the true value or predicted value. When        the true value (predicted value) falls within the tolerance, the        predicted value is determined to be relevant. When the true        value (predicted value) falls outside the tolerance, the        predicted value is determined to be not relevant. The relevancy        metric is the frequent count of the number of predicted values        that are relevant to the total number of predicted values. Items        are only counted when a true-predicted value pair is present.

Other metrics and dimensions may include those set out in BinaryClassifier Book: Binary and Multiclass Classifiers ISBN: 1615800131©2010 by Brian Kolo,the entirety of which is incorporated by reference.

EMBODIMENTS OF THE INVENTION

FIG. 1 is an example of a method for data assurance management inaccordance with an embodiment of the present invention. Differentfunctions, operations or evens of the method of FIG. 1 may be performedby or involve different entities, individuals, systems of the like. Thisembodiment occurs in whole or in part by one or more computers.

FIG. 1 describes a method for identifying optimal data combinations anddiscovering optimal integration rules. In a first step 101, a pluralityof sources of data elements is selected as source inputs to the method.The data may all reside in a single database, however, it is understoodthat certain elements of the data arise from a particular source, andthere exist multiple sources for the data. The exact data elementsselected depends upon the data owner or licensee needs or requirements.The data element can be any form, such as, for example, rawun-integrated data. The data can also be the result of other dataintegration. The data sources are selected because it is desired tointegrate the data elements of the plurality of data sources with oneanother. The selection of the exact data sources (Data 1 and Data 2) mayoccur by any means, including by person or computer.

In one alternative embodiment, the uses or needs of internal andexternal data is classified separately into categories of marketing,risk, financial and operations. The invention also includes combiningcategories or functions identified by the International Society ofAutomation (ISA) for the purpose of identifying other uses and needs forthe data. The determination of the uses and needs of the data may alsoinfluenced by laws and regulations of the United States, constituentStates and foreign countries.

A statistical random sampling of the data elements is taken from theplurality of data sources in step 102, resulting in a plurality ofsamples (Sample 1 from Data 1, Sample 2 from Data 2, etc.). Thestatistical sample can be taken by any means or method, and isapplication dependent. The sample is a portion of the populationselected to represent an entire population, or a subpopulation ofinterest. A simple random sample is a sample of size n from a populationof size N selected such that each possible sample of size n has the sameprobability of being selected. The population is the collection of allrecords of interest. Random sampling methods are generally known, andinclude associating a random number with the records of the table, sothat a particular number of the records are selected randomly. Ideally,the statistical sample would be data elements which correspond to dataelements from the other data sources. For example, the data elementsselected from the plurality of data sources would all be a person orentity name, address, phone number; a person's income, sex, maritalstatus; proprietary medicinal product name; active ingredient;pharmaceutical dose form; strength of the active ingredient; route ofadministration; units of measure, and the like. The data element doesnot have to be identical. For example, one data source may have the dataelement of a name such as Jane Smith and another data source may havethe same name, Jane Smith, or a variant such as J. Smith. In onealternative embodiment of this invention, the process does not include astep 102.

The plurality of samples from the statistical random sampling of step102 is then scored in Step 103. Scoring involves taking the Step 102plurality of samples (Sample 1 and Sample 2), and calculating the dataquality s. Metrics can be scored by any means contemplated by thisspecification, including Accuracy, Redundancy/Uniqueness, Velocity,Acceleration, Completeness, Measure, Timeliness, Coverage, Consistency,Availability, Read Time, Write Time, Propagation Time and the like.

The metrics applied depend upon the uses, needs data, objectives for thedata, and data priority. It is envisioned that in one embodiment, one ofskill in the art pre-determines which metrics modules will be applied,and instructs the metrics module accordingly. The metrics module is amodule of executable instructions received from one of skill in the artto determine which metrics module process will be applied to theplurality of samples, how the plurality of samples will be organized, ifat all, and the presentation of the results. The metrics module includesexecutable instructions for the metrics. The metrics module efficientlyfilters the plurality of samples from the database and organizes theplurality of samples based upon the executable instructions.

The plurality of samples of step 102 is retrieved upon instructions fromthe metrics module, and optionally stored in a temporary spreadsheet ordatabase. The metrics module may optionally then reorder the pluralityof samples in an ascending order of number of distinct attributes, orbased upon information lacking or present in the dataset. In anotherembodiment, the plurality of samples is ordered by descending merit. Inone embodiment, if multiple categories have the same number of distinctattributes, they are grouped in their original order. The categoriesneed not be physically reordered. Alternatively, it is determinedwhether it is more efficient to physically reorder the categories ornumber them so they appear to be reordered in interactions with the dataaccess module. This is an optional operation; no operation in theprocess is dependent on this operation.

Once this restructuring is complete, the metrics module applies a firstmetrics to the plurality of samples, and applies a value (score) to eachdata in the plurality of samples. The score may be numerical, such as apositive or negative number, or a symbol. More importantly, the scoremust be such that the score of each data in the plurality of samples maybe compared to one another. Thus, the plurality of samples must have thesame units. The score for each data is associated with the data in aspreadsheet or database. The metrics module then applies a second metricto the plurality of samples, and applies a second value (score) to eachdata in the plurality of samples. The second value (score) is associatedwith the data in the spreadsheet or database. This process is repeatedfor each metrics process selected by the metrics module.

In Step 104, once the scores for the plurality of samples arecalculated, the resulting data sets must be combined and multivariateoptimization is performed. The data can be combined in every possibleway, as each combination/data set will yield different results. Eachsequential combination yield a unique data set different from othersequential combinations, also termed a permeation. For instance,combining Data 1 and Data 2 will yield four sets of data: Data 1 byitself, Data 2 by itself, Data 1+Data 2 and Data 2+Data 1. Combiningthree data sets (Data 1, Data, 2 and Data 3) would yield: Data 1, Data2, Data 3, Data 1+Data 2+Data 3, Data 1+Data 3+Data 2, Data 2+Data3+Data 1, etc., until all data sets are combined such that

$\sum\limits_{k = 1}^{N}\; {\frac{N!}{( {N - k} )!}.}$

For each combined data set, the metrics are then plotted against eachother. Thus, two metrics yields a two dimensional plot, three metricsyields a three dimensional plot, etc. The upper bound is limited onlythe number of distinct metrics and computational power.

FIG. 2 is an example of a two dimensional plot in accordance with theinvention. The axes are oriented so that ‘better’ values are to theright along the x-axis and up in the y-axis. Relative to the point inquestion, the shaded area to the right and up are points that areconsidered definitively better as all of these points have values forboth x and y. The region to the left and below are considereddefinitively worse because all of these points have lesser values forboth x and y. The regions marked ‘Unknown’ are not comparable to thepoint in question. Points in this region are better along one axis whileworse on the other. Alternatively, the plot may be oriented so the‘worse’ values are to the right along the x-axis and up in the y-axis.The axis orientation is designed so that one quadrant has ‘better’values for both x- and y-axis, one quadrant has ‘worse’ values for bothx- and y-axis, and two quadrants are ‘Unknown.’

Notably, the value of measure for one metric is distinct from the othermetric such that the two different metric values cannot be summed, i.e.,there are multiple independent variables. In general, it is not possibleto determine a unique optimum when considering multiple independentvariables. The inventors, however, are credited with finding a method ofdetermining that certain data sources and/or certain aggregations areless desirable than others.

The plotted values are then compared to each other and the inferiorvalues are eliminated. For example, referring to FIG. 3, point B isgreater than point A because B has a higher value for both the x- andy-axis. Point A is therefore eliminated. Similarly, point C is less thanpoint A because C is less on both the x- and y-axis. Point C istherefore eliminated. However, point D is not comparable to A. Point Dhas a higher value on the y-axis, but a lower value on the x-axis.Because D is neither greater than nor less than A, Points A and D arenot comparable.

Multivariate optimization resulting from combining data sources leads toa poset with a frontier of incomparable elements. This approach examinesall possible aggregations, and then eliminates those that aredefinitively less desirable. The remaining aggregations form a frontierof non-comparable optima. Thus, the frontier is the set of datacombinations that are individually incomparable, but are definitivelysuperior to all other combinations, and the desired solution must be oneof these aggregations.

By combining two metrics together, a new type of entity in the form of aposet is created. The members of the poset may be comparable to eachother similar to metrics, or they may be incomparable to each other.Analysis of these incomparable poset members results in a Paretofrontier.

The Pareto frontier may be used to identify the optimal combination ofdata sources. The points on the frontier are reviewed to determine thetradeoff between each pair of points. Points on the frontier are notcomparable to each other, this each pair of points may exhibit atradeoff where moving between points improves the value along at leastone axis, while making the value along at least one other axis worse.

Once the frontier pairs have tradeoffs specified, an external source maybe used to select the preferred point on the frontier. The externalsource may be a person, group of people, software application, or otherautomated means of selecting an optimum among a group of incomparablefrontier points. In this manner an optimal data combination is selectedbecause each point is associated with a particular data combination (andpermutation within that combination).

In one alternate embodiment, multivariate optimization is not performedin step 104. Thus, step 104 would not address how to combine rows of 1.Instead, the multiple data sets are reviewed to find which sets areintegrated together or are chosen. Thus, the best combination isdetermined but no attempt is made to determine best way of puttingcombination together.

In step 105, the particular aggregation chosen is determined by a numberof factors. First, the aggregation selected can be determined bycustomer preferences to a particular situation. Second, the trade off ofeach remaining data point relative to another can be determined. Eithercomputer software or a person must determine what metric along the axisis more important. For example, for Point D, metric Y is higher thanPoint A but metric X is lower than Point A. Finally, linear regressionmay be used to develop cost-function based on empirical data from thecustomer.

The selection of step 105 determines the rules (step 106) whichdetermines the optimal data selection, integration and rank ordering.Notably, the chosen aggregation, or final, selected data set from Step105, has more than one field. Each data field has a range, i.e., aminimum and maximum. The range is divided into segments. Segmentcombinations are created through a Cartesian product of all segmentedfields. Each segment combination is analyzed to find how the data setswithin that segment are rank ordered and the best, i.e, best valued dataset is selected. There is at least one set of rules for every singlefield.

In a given analysis, each data row is uniquely associated with a segmentcombination. A segment combination is analyzed by examining each datarow associated with the segment combination, and comparing a particularfield value within the row to known correct data. A count is maintainedfor the number of data rows that are correct and incorrect, for aspecific field, and for a given segment combination for each datasource. A frequency is computed for each segment combination and eachdata source by dividing the correct count by the sum of the correct andincorrect counts within a particular segment combination and for aparticular data source.

This process is repeated for all segment combinations and data fields inthe Cartesian product (or the segment combinations of interest). For agiven data field, each segment combination is reviewed and a rank orderof best performing data source for that field is selected. Rank order ofbest performing may be an ordering of the data sources from the highestfrequency to lowest frequency, or may be determined with other analysis.For example, we may compute the statistical error for the frequencyvalue, and rank order the data sources according to the ratio offrequency value to the statistical variance. Alternatively, otherstatistical analysis may be included to identify the rank ordering ofbest performing data source for a particular segment combination.

This is repeated for each segment combination of interest. The result isa matrix identifying a rank ordering of preferred data source for eachsegment combination for the particular data field. The entire processmay be repeated for other data fields.

Once the rank order of best performing preferred data source isidentified for each segment combination for a particular field, dataintegration may begin. Here, the entire data set (not the sample set) isreviewed. We identify the unique rows acorss all data sources. For eachunique row of data (if a row represents an entity, a unique row is aunique entity, and the same entity may be present in multiple datasources), we identify the unique segment combination associated. Wedetermine which data sources have field information present for thisrow. We select the value of the field from the highest rank order ofbest performing data sources, and incorporate this into the integrateddata. This process may be repeated for every row in the original data.

We may repeat the data integration process for each field of interest.The result is an integrated data source where every unique row appearsonce, and every field has a single value. Within a single row, the fieldvalues present may arise from distinct data sources.

One or more steps of FIG. 1 may be repeated. Thus, in an alternateembodiment, the data elements selected are dependent upon the results ofat least one iteration of steps 101-106. Thus, after step 106 iscompleted, the objectives and priorities for the data elements will alsobe identified by the line of business owner or function which oftenwould be the data owner or licensee. Next, objectives and priorities maybe suggested by the invention based on the data elements or sources andthe applicable data quality metrics. Next, objectives and priorities maybe suggested by the invention based on previous results obtained fromoperation of the invention under similar circumstances. The process ofFIG. 1 may then be repeated.

In addition, the invention of FIG. 1 has application in several markets,including, for example, the intelligence community (“IC”). The inventionalso embodies a special feature for the IC which consists of manygovernmental agencies which obtain data from many different sources andmaintain data in many different formats. The general informationtechnology arena provides access to more sources of information todaythan ever before, yet IC agencies continue to resist sharing oracquiring much of this data because there is no objective way to measurethe contribution new sources will make to an existing data set. In otherwords, “is the data that is being shared adding any real value, is itdegrading my data, is it just taking up valuable space?”

Within the framework of the Common Terrorism Information SharingStandards (CTISS), there are requirements established for a wide varietyof sharing capabilities—everything from data format to metadata and XMLstandards. The missing standard is data quality which largely determinesthe reliability and intelligent value of the data.

In order to enhance the environment for data sharing, qualitymeasurements should be established within the CTISS in order to applyobjective (not subjective) metrics and score the quality of data,creating a “Data Reliability Index.” Using this Data Reliability Index,agencies would know the absolute and relative value of each data source,its intelligent life and its predictive power. Any report that isproduced can be traced to reveal the data provenance, audited tovalidate the relevance and scored for reliability to demonstrate whydecisions were made.

The invention contains processes, methods and software that objectivelymeasures data by the Data Assurance Dimensions. Recently, the Departmentof Justice adopted data quality as a policy, but, to date, they have notimplemented the policy. The invention process and method practiced withthe anticipated result of the Data Reliability Index, will provide acritical tool to removing the obstacles of information sharing among theIC.

FIG. 4 illustrates a flowchart describing the application of the rules(step 106) of FIG. 1 to create integrated data. In this example, threedata fields are examined: name (Step 201), address (Step 202) and income(Step 203). If the data of Steps 201, 202 and 203 is unique, i.e., onlyexists in one of the data sets (Data 1 or Data 2), then the data isentered into integrated database. If the data is not unique, i.e., datais present in both Data 1 and Data 2, then the rules engine of Step 204is used. The rules engine, the output of FIG. 1, rank orders the data ofthe one or more data sets (fix naming) to choose the optimal data setwhich is then placed in the database of Step 205.

FIG. 5 illustrates a second embodiment of the invention. FIG. 5 is aflowchart describing the process of data filtering to improve theoverall data quality of an information source. The data is firstinputted into a database in step 501. Data can be inputted by any means,including manual entry, automated inputting data from other database orelectronic documents, the like. In step 502, the data is retrieved fromthe database and processes by a data hygiene engine (DHE). The DHEfunctions to verify the data in the database is “good” data, and filterout incorrect data from being transmitted to and included in theproduction database (step 505). Thus, the DHE applies some set of rulesto determine whether this information is valid.

The DHE can examine a field, a group of fields within a row or a groupof rows as a whole. FIG. 6 illustrates this concept.

First, the DHE can contain field rules. Field rules are rules the enginecan completely execute looking only at a single field. For example, thefield rule might be field type (integers, doubles/reals, dates, times,date time, string or image). Alternatively, the DHE parameters may belimited to specific a limited set of integers, doubles/reals, dates,times, date time, strings or images. Thus, the DHE used the field rulesto confirm that a field for number entry contains numbers, fieldsapproved for characters contains characters, and the like.Alternatively, the DHE may function to prevent duplicate data from beingentered.

Field rules also include field value ranges. The field value ranges area range of field types which are acceptable, such as integer range, timerange, date range, string range, and the like.

The DHE may also be configured to have row rules. Row rules are rulesthe DHE executes to intermix two different fields, thus havingmulti-field constraints. Here, two or more data fields must be examinedtogether to determine if the two or more data fields are correct or“good data.” Examples of row rules are not limiting. In other words,whether a data field is acceptable may be dependent upon another datafield. For example, if a person's age is entered as 40 years old, butthe height is entered as 21 inches, both data fields would be flagged as“bad” information. Thus, the age range data field must correspond to anapproved height range. Another example is matching zip codes withcities. Cities can have multiple zip codes, but to confirm that a zipcode data field and city data field in a row are correct, the zip codewill have to be compared with the city.

Finally, the DHE may have data assurance management metrics (DAM) rules.DAM rules are data quality metric rules used to analyze groups of rows.The rows are grouped based on some external means for any reason desiredby the customer or DHE programmer. For example, the rows may be groupedbased on cities, location, numbers, sex, field type (integers,doubles/reals, dates, times, date time, string or image) and the like.Once the rows are grouped, the metrics is calculated for each group. Ifa group metric computed is metric underneath some critical threshold,the entire group is thrown out. The threshold is determined by thecustomer or DHE programmer.

DAM rules are very useful because an entity in a database could havemultiple data entries or rows. For example, the original databases couldhave name and address data fields where a name has multiple addressentries. A DAM rule might be that if address data fields forCompleteness is less than 80%, all rows for that name (entity, group) isbad data and should not be placed in the production database.

By applying DAM rules we can raise the overall metric for the entiredata set. For example, group the data into G total groups, and computethe completeness score for each group. In addition, compute thecompleteness score for the entire dataset. Next, set a minimum thresholdfor completeness. Eliminate all groups that do not meet or exceed thethreshold. The resulting dataset must either be empty or must have acompleteness score greater than or equal to the threshold value. Withthis method we can improve the quality metrics of the data set.

This technique may be used with any of the data quality metrics and isnot limited to completeness. Moreover, we may extend the technique tocombine a plurality of data metrics together. For example, we may usethe two metrics: completeness and accuracy. Both of these are scored onthe range [0,1]. We may set a combined rule such as: the sum of thesquares of the completeness and accuracy metrics must exceed 0.8.Applying this rule to the groups eliminates some groups and leavesothers. In this manner we create a combined constraint metric the weapply to the entire dataset.

Referring again for FIG. 4, the DHE is provided an acceptable range ofvalues or numbers. Data which falls within this acceptable range isflagged as “good data” (step 503) and placed in the production database(step 505). If the data does not fall within the acceptable range, it isflagged as “bad data” (step 504) and subject to an additional review(step 506). The additional review can be conducted by a person, computerand/or additional software. Further, the result of the review can be tomodify the data. For instance, the data can be corrected, after which itis accepted (step 507).

Further, a search engine running query may be coupled with DAMthreshold. Given a plurality of data sources, a user is queries forinformation and provides one or more minimum DAM constraints. Forexample, we may have 30 data sources with name/addresses combinationsrepresenting person entities. A user may request to select allname/addresses from the data sources where DAM coverage metric is atleast 85%. In this manner, the user is able to not only retrieve therequested information, but can set the reliability and/or confidencethresholds for the data sources. Thus, the user can be confident thatthe data returned was obtained from data sources of a particularquality.

The flowcharts, illustrations, and block diagrams of FIGS. 1 through 9illustrate the architecture, functionality, and operation of possibleimplementations of systems and methods according to various embodimentsof the present invention. In this regard, each block in the flow chartsor block diagrams may represent a module, electronic component, segment,or portion of code, which comprises one or more executable instructionsfor implementing the specified function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblocks may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also beunderstood that each block of the block diagrams and/or flowchartillustrations, and combinations of blocks in the block diagrams and/orflowchart illustrations, can be implemented by special purposehardware-based systems which perform the specified functions or acts, orcombinations of special purpose hardware and computer instructions.

The data metrics may be applied to better understand the data itself,rather than to draw conclusions about an underlying application. Forexample, we may analyze project data to measure the quality of theproject related information. This analysis can make conclusions aboutthe project data in isolation. The analysis does not need to drawconclusions about the quality of the projects themselves. The metricsprovide a framework for understanding and analyzing the data, inisolation from the purpose behind the data. Conclusions may be drawn tothe data without implicating the underlying utility or sources of thedata.

Because of this, these processes and methods may be applied toanonymized data. Anonymized data is data that is run through a mappingprocess to transform it from one form to another while preservingcertain aspects of the data. For example, we may use an epimorphicmapping to map names in a database to numbers. Thus, we may replace thename Brian' with the number 2, ‘Rick’ with the number 7, ‘Larry’ withthe number 9, ‘John’ with the number 2, etc. In this sense we haveanonymized the data so we no longer have personally identifiableinformation residing in the dataset.

As an alternative to the epimorphic map, we may use a map that preservesordering. Continuing with the example, we map ‘Brian’ to 2, ‘Rick’ to 7,‘Larry’ to 5, and John to ‘4’. With this mapping, not only have weremoved personal name information, we have preserved the lexigraphicalordering of the names. Sorting the names we have ‘Brian’, ‘John’,‘Larry’, and ‘Rick’. Similarly, their corresponding numbers are 2, 4, 5,and 7. Thus, we can sort the mapped list and know that the relativeordering is the same as if we had sorted the original list.

Using these techniques, we can create anonymizing routines that preserveone or more aspects of the original data. The data metrics may beapplied to the resulting, anonymized data, and will compute the exactsame scores as we would have found if we computed the metrics directlyfrom the original data.

This is useful because we can perform the full analysis using thesemetrics without ever needing to review sensitive information. The datasource owner can perform these mappings on the original data totransform it to anonymized data. The resulting anonymized data may beshared externally without needing to share the mappings used to createthe anonymized data. If the mappings use an epimorphic map, it ismathematically impossible for the anonymized data to be reverseengineered to obtain the original data. We may use the externally shareddata to perform a data assurance management analysis, compute metrics,etc. These metrics have the same value as we would have found if we hadworked with the original data. Thus, we may perform the entire dataquality analysis without ever needing to review the original data.

In the drawings and specification, there have been disclosed typicalillustrative embodiments of the invention and, although specific termsare employed, they are used in a generic and descriptive sense only andnot for purposes of limitation, the scope of the invention being setforth in the following claims

It should be apparent from the foregoing that an invention havingsignificant advantages has been provided. While the invention is shownin only a few of its forms, it is not just limited to those describedbut is susceptible to various changes and modifications withoutdeparting from the spirit thereof.

Example 1

Example 1 describes the process of combining three data sources alongwith their cost and number of unique records, to examine the cost andnumber of unique records for each combination.

TABLE 1 Example data sets showing the cost of each data set and thenumber of unique records provided in the data set. Data Source CostUnique Records A $10 15,000 B $15 25,000 C $20 10,000

There are seven possible combinations: (A), (B), (C), (AB), (AC), (BC),and (ABC). For each combination the total cost and the number of uniquerecords needs to be calculated. The total cost for a particularcombination is simply the sum of the costs of each data set making upthe combination, where this value is in Table 1.

However, the total number of unique records cannot be determined fromTable 1 because there may be overlap between the data sources thatcomprise the combination. For example, the combination (AB) has at least25,000 unique rows because B alone has this many unique rows. Further,there can be at most 40,000 unique rows if there is no overlap between Aand B. The combination may have any number of unique records from 25,000to 40,000.

Table 2 presents the cost and number of unique records for eachcombination of data sources from Table 1. The number of unique recordswas computed by creating a sample set and counting the actual number ofunique records that result from the combination. However, generating adifferent data sample would result in a different number of unique rowsin the combinations.

TABLE 2 The cost and number of unique records for each combination ofdata sources from the previous table. Combination Cost Unique Records A$10 15,000 B $15 25,000 C $20 10,000 AB $25 35,510 AC $30 22,037 BC $3530,036 ABC $45 36,102

The results of combining these data sets are plotted in FIG. 8. First,it is important to note that the cost axis (x-axis) runs from high tolow. This is because low cost is considered beneficial in this case, andwe want the x- and y-axis to run from ‘bad’ to ‘good’. In FIG. 8, thegray boxes indicate regions that are definitively worse with respect toa given point. Points lying within a gray box have worse values on boththe x- and y-axis than some other point.

For example, the point C lies within the box for A. Examining Table 2, Chas both a worse value for the y-axis (number of unique rows: C(10,000)worse than A(15,000)) and a worse value for the x-axis (cost: C($20)worse than A($10)). Thus, data set A would always be chosen over dataset C. Data set C would not be selected by itself when there is anotherdata set that is better in all aspects.

Some points in FIG. 8 are not comparable to each other. For example,points B and A are not comparable because neither point lies within thegray box of the other. Looking at Table 2, point A is better along thex-axis (cost: A($10) better than B($15)), but B is better along they-axis (number of unique records: B(25,000) better than A(15,000)). Inthis case it cannot be determined whether A is better than B orvice-versa. Points A and B are simply not comparable.

Examining FIG. 8, three of the data combinations are definitively worsethan another combination. Combination BC is worse than AB, combinationAC is worse than both B and AB, and C is worse than both A and B. Tooptimize both cost and the number of unique records, BC, AB, or C wouldnever be chosen. In each case there is another data combination that isdefinitively better. Thus, these data sets are eliminated from furtherconsideration.

Eliminating the definitively worse combinations leaves four remainingpoints (represented as dark circles in FIG. 8). These points are allincomparable to each other. There is no mathematical way to determinewhich of these is the best. These points form a data frontier thatcreates a boundary for the data combinations.

Although the unique optima cannot be determined among the points on thefrontier, the tradeoffs between the points are examined. For example,point A and point B are compared, showing that for a marginal costincrease of $5 (B($15)-A($10)) one gains the marginal benefit of 10,000unique rows (B(25,000)-A(15,000)). An entity or person may consider thistradeoff desirable, or they may consider it not worth the price.However, this preference cannot be determined based on the data alone.At this point the preferences of the data consumer are examined todetermine the optimal tradeoff point along the frontier.

In two dimensions (cost v. number of unique rows) the points on thefrontier can be sorted. The points can be sorted by cost or by thenumber of unique rows. Both sorting provide essentially the same list.Sorting cost from best to worst gives: A, B, AB, and ABC. Sorting numberof unique rows from best to worst gives: ABC, AB, B, and A. This sortingwas then used to construct a table showing the marginal cost andmarginal benefit for moving from one point to the next. See, Table 3.

The cost-benefit analysis is useful for determining the right datacombination for a particular data consumer. By identifying the frontierand eliminating those points not on the frontier, the analysis isfocused on the frontier points and a cost-benefit analysis for eachfrontier point is performed. The preferences of the individual dataconsumer will determine where on the cost-benefit curve (the frontier)is the best fit.

TABLE 3 The frontier data combinations are presented with the marginalcost (Cost Increase) and marginal benefit (Additional Unique Rows). Thefinal column shows the ratio of the marginal benefit to the marginalcost. Additional Additional Unique Records/ Initial/Final Point CostIncrease Unique Records Cost Increase A −> B $5 10,000 2,000 B −> AB $1010,510 1,051 AB −> ABC $20 592 29.6

Table 3 shows the results of examining the marginal cost and marginalbenefit between points on the frontier. Starting from A and moving to B,a cost increase of $5 is incurred with the benefit of $10,000 uniquerecords. This transition produces 2,000 additional records per $1 spent.Moving from B to AB, the cost increases by $10 with 10,510 additionalunique records. However, for this transition only 1,051 additionalrecords are received per $1 spent. Finally, in moving from AB to ABC, anadditional $20 in cost is incurred to obtain 592 additional uniquerecords. This transition produces only 29.6 new records per $1 spent.

The transitions between points on the frontier in order to understandthe available tradeoffs. However, only the preferences of the dataconsumer can determine which of the points along the frontier are bestsuited.

Formalization of the Data Frontier

Examining of an optimization problem occurs with an arbitrary number ofdimensions. Each dimension is self comparable meaning that for aparticular dimension in that low (high) values are less preferred tohigh (low) values. For a particular dimension it does not matter whetherlow or high values are preferred, only that there exists an orderedpreference and as we move from one value to another we can say determineif the direction of movement is preferred or not.

The symbols < and > are used to indicate that values are less or morepreferred. For example, examining the number of unique records fromTable 2, A<B, indicating that B is more preferred than A. Similarly,examining cost, B<A because the cost for A($10) is more preferred thanthe cost for B($15). We use < instead of < to avoid confusion that weare comparing values in the traditional sense. In value terms, A<B. Inpreference terms, B<A.

Each of the dimensions under consideration is ordered and plottedaccording to preferences for that dimension. Hence, number of uniquerecords would typically be sorted from low values to high values,whereas cost would typically be sorted from high values to low values(similar to FIG. 8).

Next points in this n-dimensional space are compared. Each point (datacombinations from the previous example) represents a particularcombination of values of each of the n-dimensions. For instance, for thetwo points: P1 and P2 the value of each dimension of P1 is representedas P1_(f) where t=1, 2, 3 . . . n, and the value for each dimension ofP2 is PZ_(t). P1 is considered definitively better than P2 if and onlyif P1 is preferred to P2 in every dimension. If P1 is definitivelybetter than F2, then P1>P2 or P2<P1.

If P1 is preferred to P2 in every dimension except one then imagine FIG.8 that there was some other point D that was exactly on the boundary ofthe gray box for point A. For instance, say D had a cost of $10 (exactlythe same as A), but the number of unique records was 10,000 (less thanA). In this case A is still preferred over D because A is better in allother respects. In general, if points A and D are equal in some numberof dimensions, but one A is preferred in all other dimensions, then A>D.If points have some of the same values, the symbol < is used meaning‘less preferred or equal to’. If two points have the same value forevery dimension, the symbol

is used. Thus if A

D, then A and D must have the same value for every dimension.

Mathematically,

A<B

(A _(t) <B _(t) ∀t=1,2, . . . ,n)

A

B  1

where n is the number of dimensions under consideration. Equation 1 isread ‘A less preferred than (<) B. implies that (

) A_(t) is less preferred or equal to (<) B_(t) for every (∀) t=1, 2, .. . n and (̂) A is not equivalent to (*) B’.

Similarly,

$\begin{matrix}{\mspace{79mu} {{ {A\text{?}B}\Rightarrow{\text{?}\text{?}\text{?}\text{?}\text{?}}  = 1},2,\ldots \mspace{14mu},{\text{?}\text{?}A\text{?}B}}} & 2 \\{\mspace{79mu} {{{ {A\text{?}B}\Rightarrow{\text{?}\text{?}\text{?}\text{?}\text{?}}  = 1},2,\ldots \mspace{14mu},{\text{?}\text{?}}}{\text{?}\text{indicates text missing or illegible when filed}}}} & 3\end{matrix}$

The concept of not-equivalent-to (*) arises from negating equation 3:

$\begin{matrix}{ \mspace{79mu} { {A\text{?}B}\Rightarrow{\text{?}\text{?}} {\text{?}\text{?}\text{?}}} \} {\text{?}\text{indicates text missing or illegible when filed}}} & 4\end{matrix}$

This is read ‘A not equivalent to (*) B implies (

) there exists (∃) an I such that (I) A_(i) is not equivalent to (*)B_(t)’. Again, the preference and equivalence of the overall points isdetermined by examining the preference of each individual dimensioncomprising the point.

Concepts such as ‘less preferred or equivalent to’ and ‘more preferredor equivalent to’ are defined:

$\begin{matrix}{\mspace{79mu}  {A\text{?}B}\Rightarrow{A\text{?}B}\Rightarrow{\text{?}A\text{?}B} } & 5 \\{\mspace{79mu} { {A\text{?}B}\Rightarrow{A\text{?}B}\Rightarrow{\text{?}A\text{?}B} {\text{?}\text{indicates text missing or illegible when filed}}}} & 6\end{matrix}$

where the symbol v (read or) is a logical OR. Essentially, the firststates that A<B means that either A<B or A

B. Similarly, A>B means either A>B or A

B.

Examine two points for an individual dimension (for example cost). Let aand b represent values (not necessarily different) for this dimension.In order for the dimension to be validly ordered, every two values forthe dimension must obey one of the following relations:

$\begin{matrix}{\mspace{79mu} {{a\text{?}b}\mspace{79mu} {a\text{?}b}\mspace{79mu} {a\text{?}b}{\text{?}\text{indicates text missing or illegible when filed}}}} & 7\end{matrix}$

In other words, when examining one dimension alone, for any two valuesfor the dimension, a and b, a is preferred to b, b preferred to a, or aequivalent to b. For example, when looking at cost alone, for any twovalues of cost it must be the case that either one value is preferred toanother, or they are the same.

Since there are only three possibilities for comparing an individualdimension, there are exactly four possible outcomes from equations 1-3when comparing two points:

$\begin{matrix}{\mspace{79mu} {{A \prec B}\mspace{79mu} {A\text{?}B}\mspace{79mu} {A \succ B}\mspace{79mu} {A,{B\mspace{14mu} {not}\mspace{14mu} {comparable}}}{\text{?}\text{indicates text missing or illegible when filed}}}} & 8\end{matrix}$

For two points A and B, A may be less preferred than B, A may beequivalent to B, A may be preferred to B, or A and B may not becompared. Two points will be not comparable when one point has apreferred value for one dimension while the other point is morepreferred in another dimension. Any two points must fall into exactlyone of these four categories. Points cannot simultaneously obey two ormore.

The properties from 8 are sufficient to form a poset. A poset is apartially ordered set. Mathematically, a poset is a formed on a set whenthere exists a binary relation between members of the set such that

reflexivity a<a  9

antisymmetry a<b

b<a

a

b  10

transitivity a<b

b<t

a<c  11

where a, b and c are all members of the set.

Each of these relations is met based on equations 1-3 (along with 4-6which arise from 1-3). First, the reflexivity relation is met fromequation 5 by comparing A with itself:

A<A

A<A

A

A  12

If A<A, then either A<A (which cannot be true from equation 1) or A

A (which is true from equation 3). Since A

A, A<A.

For the antisymmetry relation, we examined two points A and B such thatA<B and B<A. From equations 5 and 6:

A<B

A<BvA

B  13

A>B

A>BvA

B  14

These are both satisfied if A

B. It was also determined that it was impossible to satisfy these when A

B? To find this, we assumed that A

B. From 13, if A

B then A<B. From 14, if A

B then A>B. However, if the dimensions underlying the points are wellordered, then any two points must meet exactly one of the criteria fromequation 8. Thus, if A

B, then two of the criteria from 8 are simultaneously satisfied: A<B andA>B. This is impossible, so the initial assumption that A

B is false, which means that A

B.

Finally, the transitivity relation states if A<B and B<C then A<C. Thisis broken into four cases: A

B and B

C, A

B and B

C, A

B, and B

C, and A

B and B

C.

Case 1: A

B and D

C

If A

B then from equation 3, A_(t)

B_(t)∀t=1, 2, . . . , n. Similarly, if B

C then B_(t)

C_(t)∀t=1, 2, . . . , n. But for each dimension, if A_(t)

B_(t) and B_(t)

C_(t)then A_(t)

C_(t). If A_(t)

C_(t)∀t=1, 2, . . . , n then A

C and from equation 5 A<C.

Case 2: A

B and B

C

If A

B then from equation 3, A_(t)

B_(t) ∀t−1, 2, . . . , n. Earlier it was found that B<C, so if B

C then B<C. But since A

B, this means that A<C and thus A<C.

Case 3: A

B and B

C

If B

C then from equation 3, B_(t)

C_(i) ∀t=1, 2, . . . , n. Further, A<B, so if A

B then A<B. But since B

C, this means that A<C and thus A<C.

Case 4: A

B and B

C

If A

B and A<B then A<B. Similarly, if B

C and B<C then B<C. If A<B then A_(t)<B_(i)∀t=1, 2, . . . , n. Also, ifB<C then B_(t)<C_(t) ∀t=1, 2, . . . n. But for an individual dimension,if A_(t)<B_(t) and B_(t)<C_(i) then A_(t)<C_(t). Thus, A_(t)<C_(t)∀t=1,2, . . . , n so A<C and thus A<C.

Thus, as shown in equations 1-6, the relations for a poset aresatisfied. From this, the results and analysis of posets are used andapplied to multidimensional optimization. FIG. 9 shows a Hasse diagram,where each element of the set is represented by a point lines are drawnbetween points indicating how they compare. If a and b are points in aHasse diagram where a<b, then a and b are connected if there is no otherpoint c such that a<c; and c<b. FIG. 9 provides a Hasse diagram for thepreferences shown in FIG. 8.

The Hasse diagram provides an easy mechanism to find the frontier. Ingeneral, any point that is at the top of a Hasse diagram is on thefrontier. These points are either not comparable to any other point (ABCin FIG. 9) or they are preferred to some other set of points. Any pointthat is connected upward to another point (C, AC, and BC in FIG. 9) isnot on the frontier.

Example 2

An analysis of five different sources injecting the dimension ofCompleteness is below. Each source file was selected on the same set ofgeographic criteria. For purposes of this analysis, the five elementswere chosen by the customer: Name, Address, Phone, Income Level andPresence of Credit. Baseline sources are selected based solely on thecompleteness dimension; no other factors were considered for purposes ofthis report. The analysis matrix is generated by comparing each of thefour remaining sources with the baseline source, element by element.

Total Number Number Percentage Price Number of of Duplicate of Unique ofUnique per Thousand Records Records Records Records Records S1  4million 234,987 3,765,013 94.13% $8 S2 1.4 million 89,789 1,310,21193.59% $5 S3 4.7 million 126,430 4,573,570 97.00% $4 S4 4.9 million908,950 3,991,050 81.45% $9 S5 3.2 million 276,741 2,923,259 91.35% $6

In this instance, a consistent ID was not provided. A deduplicationprocess based solely on Name (using character string matching only) wasconducted.

It was determined that Source 3 (S3) had the largest number of uniquerecords among all five sources and also had the largest percentage ofunique records with 97.31%. Therefore, S3 was set as the baseline sourcein the name field only. Subsequent comparisons, using the samemethodology, were completed on each source creating a different baselinesource for each element.

Name Unique Address Phone Income Credit History Records % Complete %Complete % Complete % Complete % S1 3,765,013 3,332,156 83.30% 3,388,51284.71% 1,940,475 48.51% 2,055,990 51.40% S2 1,310,211 1,005,394 71.81%1,143,814 81.70% 736,339 52.60% 888,323 63.45% S3 4,573,570 3,185,90067.79% 1,765,560 37.57% 2,267,379 48.24% 2,446,037 52.04% S4 3,991,0503,197,478 65.25% 3,678,360 75.07% 1,912,458 39.03% 2,112,144 43.10% S52,923,259 2,515,221 78.60% 2,148,595 67.14% 2,022,895 63.22% 1,543,48148.23%

B. Element Profile

After the benchmark file has been established, the next step comparedeach source of data, element by element, beginning the benchmarking with“Name,” then identifying the duplicates to determine an accurate actualcontribution level of each file. For example, the table below shows thatthere are a total of 4,573,570 unique records in the baseline source fornames (Source 3) and a comparison with Source 1 (S1) shows that S1contributed 798,111 unique records. The total of these two becomes thebenchmark for the next iteration of comparison (Source 2, Source 4,etc.). This table illustrates the results of an element-by-elementcomparison of the five sources.

Value Percentage Address S1 3,332,156 83.30% S2 11,457 0.82% Total3,343,613 S1 3,332,156 S3 201,574 4.29% Total 3,533,730 S1 3,332,15683.30% S4 203,391 4.15% Total 3,535,547 S1 3,332,156 83.30% S5 139,2014.35% Total 3,471,357 Name S3 4,573,570 97.31% S1 798,111 19.95% Total5,371,681 S3 5,371,681 S2 16,745 1.20% Total 5,388,426 S3 5,388,426 S4312,388 6.38% Total 5,700,814 S3 5,700,814 S5 238,318 7.45% Total5,939,132 Phone S4 3,678,360 75.07% S1 678,521 16.96% Total 4,356,881 S44,356,881 S2 7,132 0.51% Total 4,364,013 S4 4,364,013 S3 65,560 1.39%Total 4,429,573 S4 4,429,573 S5 143,722 4.49% Total 4,573,295 Income S32,267,379 S1 298,567 7.46% Total 2,565,946 S3 2,565,946 S2 15,897 1.14%Total 2,581,843 S3 2,581,843 S4 267,231 5.45% Total 2,849,074 S32,849,074 S5 131,367 3.93% Total 3,080,441 Credit History 3 2,446,03752.04% 1 290,755 7.27%  

otal 2,736,792 3 2,736,792 2 10,725 0.77%  

otal 2,747,517 3 2,747,517 4 201,678 4.12%  

otal 2,949,195 3 2,949,195 5 67,208 2.29%  

otal 3,116,403

indicates data missing or illegible when filed

C. Summary Findings and Recommendations

For this hypothetical example analysis, the heaviest weight was given tothe Income element. Based on these results, Source 3 (S3) should beconsidered the overall baseline file with respect to completeness.However, relatively low contributions in the address and phone elementsforce more augmentation than might normally be necessary. The outcome isthat the completeness percentages be leveraged to negotiate a morefavorable pricing package in future buys.

Below is a matrix depicting a rank order of the data sources basedsolely on completeness and with a weight placed on Income and CreditHistory.

% % % % Credit % File Name Lift Address Lift Phone Lift Income LiftHistory Lift S3 4,573,570 201,574 4.29% 65,560 1.39% 2,267,379 2,446,037S1 798,111 19.90% 3,332,156 678,521 16.96% 298,567 7.46% 290,755 7.27%S4 312,388 7.23% 203,391 4.15% 3,678,360 267,231 5.45% 201,678 4.12% S216,745 1.20% 11,457 0.82% 7,132 0.51% 15,897 1.14% 10,725 0.77% S5136,783 4.54% 139,201 4.35% 143,722 4.49% 131,367 3.93% 67,208 2.29%

Further, based solely on completeness and actual contribution, Source 2and Source 5 are deleted. In terms of unique names provided, Source 2provides 1.2% lift while Source 5 provides just over 4.5%. Theadditional elements analyzed provide even more evidence that the sourcesare not providing substantive contributions. Neither source providessubstantial lift in address, phone, income, or presence of credithistory.

Based on the data costs provided by the customer, the elimination ofSource 2 and Source 5 will result in savings. An important caveat tonote is that Source 5 is the most expensive file given a cost of $6 perthousand but the file does contribute over 131,000 unique income recordsand more than 67,000 unique presence of credit records. Based oncustomer value and retention models, the client may determine there is avalue to retaining S5.

Example 3

Scientific or environmental data, especially which collected on amassive scale by persons who are not professional scientists, may alsobe the subject of data assurance management. In this example, dataassurance management was applied to the collection, processing, storageand later use of large quantities of scientific and environmental data.

The data collector was a company for science and environmental datacollectors (“SEDC”) engaged to collect and report data for air, soil,tar ball and water sampling activities being conducted as a result of amassive off shore oil spill and where there are approximately 2,500 datacollection points spanning a geographic area along the Gulf Coast fromNew Iberia, La. to Tallahassee, Fla. Over 20 teams consisting ofapproximately 800 people collected and reported data samples everyfifteen (15 minutes).

SEDC also managed several other data streams related to the project:

-   -   Off-Shore Vessel AreaRAE Data (71 AreaRAE Units)    -   Off-Shore Vessel Passive Dosimeter Badge Results (120 samples        per day)    -   On-Shore Staging Area Passive Dosimeter Badges (120 samples per        day) and    -   SCAT Tar Ball Sampling Data.

All incoming field data, whether collected on paper or via the MC-55EDA, was imported or entered into SEDC's local scribe database.Appropriate data was then pushed to the Environmental ProtectionAgency's (EPA) Scribe.NET offsite enterprise database to share withlocal and federal authorities. SEDC responded to operational andresponse data requests using on-site Scribe databases.

Numerous data and data sources needed to the invention of FIG. 4implemented and the resulting data integrated into a data warehousingsolution to avoid disparate silos of information and data. The oilcompany, the EPA, the US Coast Guard, the Department of HomelandSecurity, and other government regulators cannot make enterprise leveldecisions because the data does not give them the completepicture—environmental readings are in multiple databases, worker accessand job roles are in other databases, etc.

The objective was to have all of these data streams, including allfoundation or provenance data, processed under the invention andinstalled in a single data repository or data warehouse.

The demands of collecting, processing and using the data required a newarchitecture of a single data warehouse for data capture and managementthat would:

-   -   Allow users to access appropriate data to make queries, run        reports and make decisions on a near real-time basis    -   Allows decision makers to determine the reliability and absolute        value of each data source    -   Allow decisions to be made at an enterprise level rather than        through silos of data    -   Allow companies to continue to implement their best practice        data collection efforts without interrupting their process,        while still being able to make data contributions to the        warehouse    -   Create a flexible environment that will allow any reporting or        mapping tool to utilize all of the permissible data

Individual companies and reporting tools, such as EQuIS, needed to beable to continue their best practice of analytics and reporting withouthaving to adapt to new data formats or parameters.

The implementation of data assurance management and the design of thedata warehouse was to ensure the flexibility that will enable reportingtools (EQuIS), appropriate agencies (EPA), and companies (the oil spillcompany) to have the ability to see permissible data and make decisionswithout changing their current practice or architecture.

The needs of this SEDC project, as embodied in the invention—dataassurance management—provided for a uniform data model for all dataregardless of the data source or data collector. This allowed easierretrieval and analysis of information from multiple reporting tools asopposed to multiple variations of proprietary data models.

When importing from several data sources, a single data warehouse,induced into the process after and during the implementation of dataassurance management processes and methods provided the ability toverify information across different sources for errors, inconsistenciesand duplication. Information was refined and data installed into asingle repository, where data could then be verified, normalized andassured of its integrity.

Data warehouses that store multiple sources under one repository areassured of control under a single handler allowing for a more secureinfrastructure and since a data warehouse functions separate fromoperational systems (reporting tools, access applications, etc.), thisallows fast retrieval times of data without causing delay.

A single repository of data across multiple sources allowed for uniquetrend analysis and reporting where data is analyzed from all sources asopposed to trends found in individual systems.

Finally, a central data warehouse ensured properly mandated InformationAssurance, assuring the handing of the data with regards to:

-   -   Confidentiality—Assurance that information is not disclosed to        unauthorized individuals, processes or devices    -   Integrity—Quality with regards to the logical correctness,        consistency and reliability of the information as well as        assurance against unauthorized modification or destruction    -   Availability—Timely, reliable access to data and information for        authorized users as it is needed    -   Authentication—Security measures to verify an individual's        authorization to receive specific information    -   Authenticity—The necessity to ensure information is not only        genuine but, in the case of duplicates, establishes a means of        identifying a master record    -   Non-repudiation—Assurance that a sender of data is provided with        proof of delivery and the recipient is provided with proof of        the sender's identity, so neither party can deny delivery or        receipt of a transaction.

1. A data filtering method comprising: receiving data from a database;determining if the data is valid by utilizing rules in a data hygieneengine, wherein the rules are: (1) at least one rule selected from thegroup consisting of: field rules, row rules, groups of rows rules, andcombinations thereof, and (2) at least one data assurance managementmetrics rules, wherein the data assurance management metrics rules aredata quality metric rules used to analyze groups of rows based on userrequirements, wherein data quality metrics are calculated for groups ofrows and compared to a threshold; filtering invalid data; outputting aproduction database without the invalid data.
 2. The method of claim 1,wherein the field rules are related to only one field.
 3. The method ofclaim 1, wherein the field rules are field type or field value ranges.4. The method of claim 1, wherein the row rules are rules to intermixtwo different fields.
 5. The method of claim 1, wherein the groups ofrow rules group rows, calculate metrics for each group of rows, andcomparing to a threshold.
 6. The method of claim 5, wherein an entiregroup is eliminated if the calculated metric is below the threshold. 7.The method of claim 5, wherein the threshold is predetermined by a user.8. The method of claim 1, wherein a data assurance management metric isused to group the data into multiple groups to complete a score for theentire dataset.
 9. The method of claim 8, further comprising comparingthe score to a threshold and eliminating all groups that do not meet thethreshold.
 10. The method of claim 1, wherein a data assurancemanagement metric is used to group the data into multiple groups andcompute a completeness score for the entire dataset.
 11. The method ofclaim 10, further comprising comparing the completeness score to aminimum threshold and eliminating all groups that do not meet thethreshold.
 12. The method of claim 1, wherein a plurality of dataassurance management metrics are used together to group the data intomultiple groups and compute a score for the entire dataset.
 13. Themethod of claim 12, further comprising comparing the completeness scoreto a threshold and eliminating all groups that do not meet thethreshold.
 14. The method of claim 13, wherein the threshold isdetermined by a rule related to the plurality of data assurancemanagement metrics.
 15. The method of claim 1, wherein an acceptablerange of values or numbers is provided to the data hygiene engine. 16.The method of claim 1, wherein invalid data is subject to further reviewbefore being filtered.
 17. The method of claim 1, wherein the invaliddata is modified to create valid data that is then considered valid. 18.The method of claim 1, further comprising running a search engine querycoupled to a threshold for the at least one data assurance managementmetrics.
 19. The method of claim 18, further comprising settingreliability thresholds.
 20. The method of claim 18, further comprisingsetting confidence thresholds.